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Abstract. We describe a parallel version of our tree-code for the simulation of self-gravitating systems in 
Astrophysics. It is based on a dynamic and adaptive method for the domain decomposition, which exploits 
the hierarchical data arrangement used by the tree-code. It shows low computational costs for the parallelization 
overhead - less than 4% of the total CPU-time in the tests done - because the domain decomposition is performed 
'on the fly' during the tree setting and the portion of the tree that is local to each processor 'enriches' itself of 
remote data only when they are actually needed. The performances of an implementation of the parallel code 
on a Cray T3E are presented and discussed. They exhibit a very good behaviour of the speedup (= 15 with 16 
processors and 10 5 particles) and a rather low load unbalancing (< 10% using up to 16 processors), achieving a 
high computation speed in the forces evaluation (> 10 4 particles/sec with 8 processors). 
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1. Introduction 



Tree-codes (Barnes & Hut 1986, hereafter BH; Hernquist 



1987) are particle algorithms extensively employed in the 



simulations of large self-gravitating astrophysical systems. 
They have the capability to speed up the numerical evalu- 
ation of the gravitational interactions among the N bodies 
of the system. Of course, the parallelization of the algo- 
rithm aims at attaining larger and larger N in the numer- 
ical representation of the real system; this is important 
not only to improve the spatial resolution, but also to get 
much more meaningful results, because a too low number 
of 'virtual' particles in comparison with the number of real 
bodies, gives rise to a unphysical shortening of the 2-body 
collisions time. 

In general, an efficient parallelization means a data dis- 
tribution to the processors, the so called domain decom- 
position (DD), so as to i) distribute the numerical work 
as uniformly as possible, ii) minimize the data exchange 
among the processors (hereafter PEs). Of course, this lat- 
ter point is relevant only on distributed memory platform. 
Moreover, such DD should be performed with a minimal 
computational cost. 

In the numerical evaluation of the gravitational inter- 
actions it is difficult to deal with these tasks, because the 
long-range nature of gravity makes data transfer among 



PEs unavoidable. Furthermore, self-gravitating systems 
have often non-uniform mass distributions, that give rise 
to very inhomogeneous distributions of the work-load (the 
amount of calculations needed to evaluate the accelera- 
tion of a particle). This implies that the DD should be 
weighted, in some way, according to the work-load. 

Finally, the hierarchical arrangement of the subsets of 
the mass distribution which the tree-code is based on, im- 
plies that most of the computations for evaluating the ac- 
celeration of a particle regard the evaluation of the force 
due to close bodies. This suggests a spatial DD: each do- 
main should be enclosed in a volume as contiguous and 
compact as possible. 

At present, one of the most used approach for the DD 
is th e orthogon al rec ursive bisection (Wa rren & Salmon 
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Dubinski 1996; Lia & Carraro [2000t Springel et al 
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2000 ), which consists in a recursive subdivision of the 
space in pairs of sub-domains equally weighted. On ev- 
ery sub-domain, the owning PE builds (independently of 
the others) its local tree data structure. Such structure is 
then enlarged enclosing those data, belonging to remote 
trees, that are needed to the local forces evaluation. 

The main disadvantage of this approach is that the 
retrieval of remote data is complicated (and computation- 
ally expensive, too) mainly because of the lack of an ad- 
dressing reference scheme common to all th e PEs . The 
'hashed oct-tree' method (Warren & Salmon 1995 , here- 
after WS) solves this problem, but a certain implemen- 
tation complexity still remains and some radical changes 
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are required in the way the tree arrangement is usually 
stored. For this reason we decided to carry out a new and 
easy to implement method for sharing efficiently the com- 
putational domain among PEs in a distributed memory 
architecture. 

This paper is organized as follows. In Sect. || we de- 
scribe the differences of our tree-code in respect with the 
original method illustrated in BH, both from the general 
point of view of the algorithm and in connection with 
the parallelization approach. This latter is described in 
Sect. |^ for both the stages the tree-code is made up of. 
Finally, the performances of a PGHPF (Portland Group 
High Performances Fortran) implementation running on a 
Cray T3E computer are discussed in Sect. ||. 

2. Our version of the BH tree-code 

In this Section we describe some modifications of the orig- 
inal BH tree-code, which are also important (as we will 
see later) from the point of view of the parallelization 
technique. They regard the construction of the tree ar- 
rangement. The reader who is not familiar with the basi- 
cal features of tree-codes, can found detailed descriptions 
in Warren & Salmon ( 1 9 9 2| ) : Hernquist (1987); Hernquist 
& Katz J1989D ; Springel et al. ( |2000[ ). 

Let us give some definitions that may differ from those 
used by other authors: i) the boxes are the cubes that make 
up the hierarchical structure (arranged as an octal tree 
graph) built during the tree-setting stage by subdividing 
recursively the root box enclosing all the system, ii) the 
root is at the 0-th subdivision level and iii) a parent box, 
at the l-th level, is a box which includes more than one 
particle and which is splitted into 8 cubic sub-boxes at the 
(/+l)-th subdivision level, iv) the terminal boxes are those 
with just one particle inside and v) the tree-traversal is the 
phase in which all the particles accelerations are evaluated 
by "ascending" the tree from the root upward. 

We adopted an internal memory representation of the 
tree structure that makes use of pointers, i.e. integers 
pointing to the locations in which the data of the sub- 
boxes have been stored. This allows to accede recursively 
to boxes' data with a 0(\ogN) order of operations and, 
moreover, it permits, as we will see, to complete easily and 
with a minimal communications overhead the portion of 
the tree initially assigned to a given PE, appending those 
remote box data needed to calculate the accelerations on 
the particles in its domain. 

The tree-setting is performed through a recursive 
method different from the original approach used in BH. 
The method can be outlined as: given a parent box and 
the set of all the particles it contains, the subset of the 
particles enclosed in a given sub-box is found. If it is non- 
empty then the multipolar coefficients of the sub-box, plus 
various parameters, are evaluated and stored into a free 
memory location. Then a pointer in the parent box is set to 
point to such location. This procedure starts from the root 
box and is repeated recursively for any non-terminal sub- 
box. Also for the evaluation of the multipolar coefficients 



the recursive 'natural' approach is used (as in Hernquist 



1987). 



Within the framework of the tree-setting just de- 
scribed, it is important to employ a fast method to check 
whether a particle belongs to a box or not. In this respect, 
we implemented a spatial mapping of the particles, which 
'translates' the coordinates of each of them into one bi- 
nary number (a 'key'), enclosing all the necessary informa- 
tions about which box contains it at any subdivision level. 
Moreover, it is quite easy to get quickly such informations 
using binary operations within a recursive context. Details 
about such method are given in Appendix |X| 

3. The parallelization method 

3.1. Parallel tree-setting and domain decomposition 

As far as the parallel execution of the tree-setting is con- 
cerned, an important feature of the logical data structure 
is that the lower levels of the tree are made up of few but 
highly populated boxes while, on the contrary, at upper 
levels there are many boxes, but containing few particles. 

This suggests two different schemes of work distribu- 
tion to the PEs during this phase. Indeed, in order to 
have a good load-balancing, it is desirable to make the 
number of the 'computational elements' much larger than 
the number p of the PEs which the work has to be dis- 
tributed to. Hence, in our case it is convenient that, on 
one side the parallel setting up of the lower levels of the 
tree is done by assigning to each processor a sub-set of the 
particles belonging to the same box; on the other side, for 
the construction of the upper levels, whole sets of parti- 
cles belonging to distinct boxes should be assigned to each 
PE. Thus, before going any further, let us give some useful 
definitions. Given k > I a fixed integer, we call 

— lower box: a box containing a number of particles n 
such that n > kp; 

— upper box: a box with n < kp; 

— pseudo-terminal (PTERM) box: an upper box having 
a parent lower box. 

Finally, we call 'lower (upper) tree' the portion of the 
whole tree made up of lower (upper) boxes (see Fig. [[]) . 

The parallel tree-setting is then executed in two steps, 
the first step for the construction of the lower tree and the 
second one for that of the upper tree, as described in the 
following Sections. 

3.1.1. Lower tree setting 

In the first step, the particles are initially distributed at 
random to the PEs, i.e. without any correlation with their 
spatial location. Then, the PEs start building the tree, 
working on lower boxes, according to the recursive proce- 
dure described in Sect. ||. They consider, at the same time, 
the same box but dealing, of course, only with their own 
particles. This phase of the tree setting stops at PTERM 
boxes (instead of at terminal ones, as it is done in the 
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serial version) where no more 'branches' are set up. In or- 
der to obtain an efficient parallel execution, the evaluation 
of the multipolar coefficients of (lower) boxes is done di- 
rectly, i.e. by means of summations running over the set of 
particles contained, rather than by the recursive formulas 
involving the sub-boxes coefficients. 

To attain the maximum data-locality, each PE keeps 
a copy of the tree structure in its own local memory. This 
makes that, in the tree-traversal stage, the reading access 
to lower boxesQ data will not be slowed down continuously 
by the inter-processors data transfer, which is one of the 
performances bottlenecks of codes running on distributed 
memory parallel computers. Note also that the amount 
of local storage needed to this part of the tree scales like 
the number of lower boxes, that is like ~ rlogr, being 
r ~ Nikp)- 1 the total number of PTERM boxes. Thus, 
the memory occupation for each local copy of the lower 
tree scales, conveniently, like the number of particles per 
processor. During the current step, the large number of 
particles in lower boxes (provided that k is sufficiently 
great), together with the random particle distribution to 
the PEs, ensures a good work-load balancing. 

At this point a suitable re-distribution of the parti- 
cles (for an efficient tree-traversal) is performed 'on the 
fly' by exploiting our recursive way to set up the logi- 
cal tree structure. Actually, such recursive approach, to- 
gether with the tech nique used in mapping the particles' 
coordinates (see Fig. |A.f| ), leads to a particular order in 
which the PTERM boxes are met. Such order corresponds 
to a one-dimensional path connecting a PTERM box to 



1 They are very frequently met in evaluating particles' accel- 
erations, because they generate the long-range field. 




Fig. 2. Example of PTERM boxes inspection order, for 
a uniform particle distribution in 2-D. The boxes are on 
the 3 rd level of the spatial subdivision and each pattern 
corresponds to a different processor's domain (among 4 
PEs), while the dashed arrows indicate the 'jumps' along 
the path. 

another spatially adjacen^ in a self-similar fashion (see 
Fig. |^). Though similar to that used by WS, the order is 
obtained in a substantially different way, as discussed in 
Appendix |A|. 

This order is such that, by cutting the path into p con- 
tiguous pieces with the same 'length' and then assigning 
to the i-th PE the particles contained into all the PTERM 

2 Actually, along the path there are some discontinuities in 
form of 'jumps' from a PTERM box to another non-adjacent 
one, which could be avoided with a more complicated (non 
self-similar) order, as described in WS. 
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boxes in the z-th piece, one obtains a particularly efficient 
DD. Indeed, it is characterised by having most of sub- 
domains compact in space ^. 

The particles assignement is performed every time a 
PTERM box is met. Moreover, other than the particles 
contained, also the data regarding the box are stored into 
the local memory of the PE. For this reason, it is necessary 
that the pointer in the parent box pointing to the PTERM 
one, includes also the information about which PE owns 
this latter. This way, any other PE can easily accede to the 
box data. The information is included within the pointer 
itself by constructing the full- address of the PTERM box, 
as described in Appendix |b[ 

As in WS, we found that a good load-balancing can be 
attained if one 'measures' the path length in terms of the 
computational loads of the PTERM boxes, defining such 
quantity as the sum of the 'weights' of all the particles 
inside them, being the particle weight proportional to the 
number of bodies (both particles and boxes) whose force 
on it has been evaluated during the tree-traversal of the 
previous time step. 

An important difference with respect to the WS' 
method is that in our scheme the DD is performed via 
a (PTERM) boxes distribution to the PEs, rather then by 
directly distributing particles, and this means an easier 
parallel tree construction of the sub-trees in comparison 
with the HOT method. In this latter, in fact, there are 
several complications in setting up the local parts of the 
tree, such as: the 'broadcasting' of branch boxes, the inter- 
processor exchange of data regarding particles sited at the 
border of sub-domains, etc. In Fig. ^ an example of par- 
ticles distribution to 4 PEs is plotted for a non-uniform 
case. 



3.1.2. Upper tree-setting 

The second step is the construction of the remaining up- 
per part of the tree, which is performed according to the 
same recursive procedure used for lower boxes but, in this 
case, every PE works independently and without synchro- 
nism, starting every time from each PTERM box in its 
own domain. Every PE will store, in its local memory, the 
logical and data structure of all the sub-trees whose roots 
are given by such boxes. For example, in Fig. [j], the PE 
owning the rightmost PTERM box, will set up the sub- 
tree enclosed within the dashed rectangle. Another differ- 
ence, in comparison with the lower tree setting, is that 
all the pointers box — > sub-box adopt the full-addressing, 



The efficiency comes from that, in evaluating the acceler- 
ation on a particle, most of interactions are with the closest 
bodies. In fact, the number density of the bodies satisfying the 
opening criterion at a distance d from the particle is roughly 
oc (fid) (with 9 the open-angle parameter) which decreases 
very rapidly with the distance. Hence a compact sub-domain 
will contain most of such bodies. 





Fig. 3. Example of domain decomposition among four 
PEs, for a cluster represented with 16,384 particles. Top: 
'section' of the 3-D particles distribution lying on the yz 
plane. Bottom: the sections of the 4 sub-domains have 
been spaced for clarity; note how one of them (the grey 
one) is not completely compact. 



because, in principle, any box could be required by other 
processors during the tree-traversal. 

3.2. Parallel tree-traversal 

In this stage, each PE uses exactly the same recursive 
procedure usually adopted in serial tree-codes (see e.g. 
Hernquist 1987) to evaluate the forces acting on the par- 
ticles though, of course, only on those belonging to its 
domain. 

At the beginning of this phase, such domain includes 
both the whole lower tree and those sub-trees whose roots 
are the PTERM boxes assigned to the PE. We can call 
all this set of boxes the initial locally essential tree (LET). 
This latter is not yet "complete" , in the sense that it does 
not include, yet, all the bodies that are necessary to evalu- 
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ate the forces on the particles in the PE's domain, lacking 
some of the upper boxes belonging to remote sub-trees 
(though they are the minority of all the required bodies, 
thanks to the spatially compact DD). 

Anyway, the suitable addressing scheme adopted and 
the belonging of the boxes to the overall tree topology 
(as well as the recursive approach for the tree-traversal), 
allow us to perform the LET completion 'at run time'. 
Given a particle belonging to a certain PE and given a box 
B 6 LET whose sub-boxes have to be handled: if a sub- 
box does not belo ng to the LET, as can be immediately 
recognized by Eq. B^, then (i) get all its data (and all 
its pointers) f rom the owning PE's memory at the address 
given by Eq. B^, (ii) copy them into a free location of 
the local memory, (iii) change the pointer in B to point 
this new local address. This way, the sub-box is included 
into the LET and when any other particle requires it, it 
is already found into the local memory. This mechanism 
minimizes the amount of inter-processor communications. 

Note that the full-addressing mechanism makes remote 
data retrieval immediate, thanks to that the local sub- 
trees are portions of a single and global tree arrangement, 
unlike the orthogonal recursive bisection scheme (Warren 
& Salmon 1992). In our opinion our addressing method is 
as 'global' as that used in WS, though easier to be imple- 
mented and with a lower computational overhead, as we 
will see. 

4. Results 

The efficiency of the parallelization method has been 
checked by analysing the performances for a single eval- 
uation of the forces on a set of particles. Such evaluation 
includes one tree-setting (including DD) and one tree- 
traversal step, as well as all the necessary inter-processor 
communications and remote accesses (LET completion), 
without the time integration of trajectories. Indeed, it is 
known that most of the CPU-time used by a simulation 
of large self-gravitating systems is spent in the computa- 
tion of the gravitational interactions. Usually this latter 
takes at least 80% of the total CPU-time, while the time 
advancing of particles' dynamical quantities (position, ve- 
locity, etc.) takes just ~ 10%, because its computational 
cost scales like N. This cost is even smaller for low-order 
time integration schemes, like those generally used in con- 
junction with tree-codesQ. Moreover, it is generally very 
simple to parallelize time integration methods, because 
the time advancing of a particle is independent of that of 
the others and the corresponding work-load is, normally, 
very homogeneous. For this reason our tests involve the 
forces computation only, which indeed represents the key 
problem to overcome for getting a good parallelization. 

Nevertheless, one has to be careful when time integra- 
tion algorithms adopt individual time steps (that is a de- 

4 The commonly accepted degree of accuracy in the force 
calculation is about one part in 10 2 -10 3 . This makes high- 
order time schemes superfluous (at least in not too dynamic 
situations). 




Fig. 4. The clustered set of 128K particles used in the 
tests. 



sirable feature when dealing with self-gravitating systems 
with a wide range of time scales), because they imply a 
force evaluation which is mostly performed on a subset 
of the entire set of particles. We discuss this problem in 
Appendix ^|. 

All the tests were performed on a set of N — 128K 
equal mass (to) particles distributed according to the 
Plummer profile (known to fit acceptably globular clus- 
ters, at least in region s not too far from the center, 
see Binney & Tremaine (19871) p(r) = p (l + r 2 /r 2 ) _5/2 , 
within a sphere of radius R such to include a particles 



total mass, M — Nm, such that M 
where M ~ — a^^ 2 



0.995 x M a 



- J Q Anr z pdr. The core radius is chosen as 
r c = 6 x 10~ 2 _R, while p = 3M 00 (47rr|?) _1 is the central 
density. This is a highly non-uniform distribution, being 
p /p(R) ~ 10 6 (see Fig. |). 

For the box-particle force evaluation we used multipo- 
lar series truncated at the quadrupoles, and the original 
BH 'opening' criterion with different values for the open- 
angle parameter 9. The maximum number of particle in 
PTERM boxes (see Sect. 3.1) was set equal to 16 x p, this 
value giving the best performances for any number of PEs 
(p), as we checked. The tests ran on a Cray T3E and re- 
garded our parallelization method implemented by means 
of PGHPF/Craft directives. 

4.1. The performances 

The good efficiency of the parallelization approach de- 
scribed in previous Sections is basically shown by three 
facts: i) a behavior of the relative speedup^] close to the 
ideal one (linear in p); ii) a low unbalancing of the work- 
load; iii) a low parallelization overhead^. 

5 The rapidity gain one has when going from a run with one 
PE to the same run with p PEs. 

6 The CPU-time needed by all those instructions which 
would not be necessary in a serial execution. 



6 



P. Miocchi and R. Capuzzo-Dolcetta: An efficient paraiiei tree-code 




Fig. 5. Speedup for the force evaluation on N = 1281^ particles (Plummer profile). Bottom: code speed vs. the error 
on the forces 5a/ a at the 90% percentile (i.e. such that 90% of particles have a force affected by a relative error < da/ a) 
for various number of PEs (p) and different values of 9, as labeled. Top: the speedup with respect to a single PE run 
(in the 9 = 0.7 case) plotted for the overall force evaluation and for the tree-traversal (t.t.) only. 



The good overall code scalability is shown in the upper 
panel of Fig. |5|. It appears to be rather good for p < 16 
(i.e. N/p > 8192). For more than 16 PEs, the perfor- 
mances start to degrade because of the too small number 
of particles per PE: with p < 32 one has < 4096 particles 
per PE that makes the tree-code not that efficient in itself. 
To give immediate indications of how fast the calculations 
are for a given accuracy, we show in the lower panel the 
absolute speed of the code versus the relative error on the 
forces evaluation. The relative error on the evaluation of 
the force (per unit mass) on a particle - due to the use of 
the truncated multipolar expansion for the box satisfying 
the opening criterion - is defined as: Sa/a = \a tc — a\/a, 
where a is the magnitude of the acceleration evaluated by 
means of the 'exact' particle-particle summation, and at c 
denotes the magnitude of the acceleration calculated via 
the tree-code. 

The computational work-load is well balanced among 
the PEs. A natural way to quantify the load unbalancing, 
u, is via the formula u = (i max — i m i n )/ <t>, where t max 
and t m i n are, respectively, the maximum and the minimum 
CPU-time spent by the PEs to perform a given procedure 
and <t> is the averaged CPU-time. From Fig. |, we can 
see that, for a sufficiently high number of particles per 
PE (say for N/p > 8, 000) we have a quite low u, that is 



less than 10% for the tree-setting and always less than 6% 
during the tree-traversal, demonstrating the efficiency of 
the DD. Only for N/p ~ 4, 000 the unbalancing becomes 
unacceptable (> 50%). 

Last, but not least, we can see in Table[l] that the paral- 
lclization overhead takes only 3.2% of the total CPU-time 
in a 8 PEs run. Moreover, as we verified, this percentage 
increases significantly only when N/p < 8, 000. Thus, in 
optimal conditions the 'surplus' of CPU-time specifically 
needed to make the parallclization operative (in our case 
spent by the DD plus the LET completion) is almost neg- 
ligible. This point is crucial in order to state that a parallel 
code is really efficient. 

Indeed, even in a distributed memory context one can 
get a highly scalable tree-code with a good work-load bal- 
ancing, using just an absolutely banal DD. In fact, the 
load balancing can be achieved by means of a 'dynamical' 
distribution^ of the particles to the PEs (see Singh et al. 
1995 ), tolerating a great deal of communications and re- 



mote accesses. Following this approach, we implemented 
another parallel version of the tree-code obtaining a good 
speedup scaling and a low load unbalancing on the T3E, 



7 All the particles to be processed are put in a 'queue'. As 
soon as a PE has finished its previous work, it gets the first 
particle in the queue and evaluates its acceleration. 
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Fig. 6. Load unbalancing parameter (u) for a single force 
evaluation with 9 — 0.7. Solid line: for the tree-setting 
stage; dotted line: for the tree-traversal. 



but the communications overhead heavily affected the ab- 
solute performaces making it not convenient for practical 



use (see Capuzzo-Dolcetta & Miocchi 1998, 1999) 



Table 1. Code CPU-time consumption with 8 PEs, 128K 
particles and 9 = 0.7. Italic: parallelization overhead. 



Code section 


sec 


% 


tree- setting 


1.4 


6 


domain decomposition 


0.1 


0.5 


tree-traversal 


21 


94 


LOWER tree-traversal 


3.3 


15 


UPPER tree-traversal 


18 


80 


LET completion 


0.6 


2.7 


total 


22.4 


100 



Finally, the code memory usage requires roughly 1 
Kbyte per particle. For instance, more than 10 7 particles 
can be handled by 128 processors having 128 Mbyte each. 
Such amount of particles can be furtherly increased in a 
more optimized message passing implementation. 

4.2. Comparisons with other codes 

To make really significant comparisons of the perfor- 
mances of different tree-codes, one should ensure the forces 
computation be done with the same accuracy and on the 
same set of particles. Of course, such performances depend 
also on the opening criterion adopted, because at a given 
accuracy and with a given set of particles, different open- 
ing criteria can give different amounts of interactions to 
evaluate on a particle, thus giving different computation 
speeds. Therefore, if one wants to compare specifically the 




0.001 - 



200 400 600 800 1000 1200 

<w> 

Fig. 7. Relative error on forces evaluation (at 90% per- 
centile) versus the averaged computational work, using the 
BH opening criterion. The corresponding values for the 
open-angle parameter 9 are labeled. 



efficiency of the parallelization approach, then the tests 
should be done with the same opening criterion tooQ. 

Unfortunately, it is often very difficult to make such 
conditions hold with the tree-codes available in the liter- 
ature. For this reason we decided to compare codes speed 
at a given amount of computational work done to evalu- 
ate forces. This makes the comparison independent of: the 
particles distribution, the number of particles, and the ac- 
curacy (i.e. the opening criterion and its parameters). In 
tree-codes the amount of numerical work done on a given 
particle, w%, is naturally quantified as the number of 'in- 
teractions' evaluated to estimate the force on it (as in the 
particle work- load definition of Sect. |3.1.l[ ), namely the 
total number of bodies (both boxes and single particles) 
of which the tree-code evaluates the force they exert on 
the particle itself (in a particle-particle method one would 
have uii = N — 1). 

In Fig. ^ we plotted the error on the forces evaluation 
{5a I a), versus the averaged computational work, <w>= 
(J2i W i)/N, needed by our code to evaluate the forces on 
the set of N = 128K particles above-described. This allows 
us to compare 'honestly' our code performances with those 
of other codes. 



In Springel et al. ( 2000 ) the authors tested their tree- 
code (GADGET) on a Cray T3E. They give speed mea- 
surements for a rather clustered cosmological distribution, 
using the BH opening criterion with 9 = 1. In such condi- 
tions their code gives 8a/ a ~ 3.5 x 10~ 2 at 90% percentile, 
with <w>c± 200 interactions per particle. From Fig. (7] we 



8 In general the implementation of the opening criterion does 
not really affect the parallelization method, and it is a rather 
simple task as well. 
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can see that, with the Plummer distribution we used, the 
closest value for <w> is achieved for 9 = 1.2, which gives 
<w>~ 230 interactions per particle and corresponds to 
an accuracy of 5a/a ~ 3 x 10~ 2 . Such accuracy is ob- 
tained in a run that, using e.g. 8 PEs, is performed with a 
speed of 15, 000 particles/sec (as shown in the lower panel 
of Fig. U). Note that such run includes, apart from the 
tree-traversal, also the tree-setting and all the overhead 
needed to a parallel execution. At the same conditions, but 
performing only the tree-traversal stage, GADGET has a 
lower speed: about 13,000 particles/sec (with 8 PEs). It 
is worth noting that it shows a load unbalancing of about 



19% with N/p ~ 3 x 10 , while our code exhibits 



4% 



with the same ratio N/p. One has to say, finally, that 
GADGET has been implemented using message passing 
instructions, certainly a more suitable approach with re- 
spect to that we adopted (see next Section). 

Another comparison can be done with the parallel code 
illustrated in Dubinski (1996). In this case the code ran 
on a Cray T3D and the author employed a more efficient 



modified BH opening criterion (Barnes 1994). Anyway, 
the rapidity of the code is given both in terms of the total 
computational work (N <w>) performed in one second, 
and in terms of particles per second. With 16 PEs such 
code evaluated about 3 x 10 6 interactions/sec, correspond- 
ing to 6,000 particles/sec, for a forces computation per- 
formed on a cluster with N = 1.1M particles. This means 
that the code spent t ~ 180 sec in that run. Hence, being 
N <w> /f~3x 10 6 , it handled <w>~ 500 interactions 
per particle. With our tree-code such <w> corresponds 
to an accuracy of about 5a/ a ~ 4 x 10~ 3 (Fig. Q), and 
so it is performed at ~ 13, 000 particles/sec with 16 PEs 
(Fig. ||). Of course, our speedup factor of 2 is partially due 
to the improved performances of the T3E compared with 
the T3D, even if the direct message passing approach used 
by the author is more efficient then the use of the PGHPF 
compiler. 



4.3. Remarks about the implementation 

The use of PGHPF/Craft directives is certainly not the 
best way to implement a parallel tree-code on a distributed 
memory architecture, but we did so in order to obtain 
quickly a ready-to-use version. Indeed, the directives per- 
mit just to distribute in a simple way loop iterations to the 
PEs, as well as the elements of shared arrays to their local 
memory, without using explicit message passing routines. 

The highest price to pay for such simplicity, is that the 
way message passing operations are actually performed 
cannot be controlled and optimized. In our specific case, 
each PE has to copy all its local upper tree to logically 
shared arrays, in order to enable the other PEs to ac- 
cede to it during the tree-traversal. This means a consid- 
erable waste of memory (and communications) , which can 
be avoided in a direct message passing implementation, so 
to reduce furthcrly the parallclization overhead. 



Similar storage and clock time wastings are due also 
to that the PGHPF compiler is generally not capable to 
recognize local references within a shared array, which are 
handled as if they were remote instead. Anyway, we exper- 
imented also a slow local referencing, due to a non optimal 
cache-memory managing. For these reasons, our next goal 
is the development of an MPI version, which would also 
be easily implementable on different distributed memory 
platforms. 

Finally, we have also carried out an implementation 
suitable for running on a shared memory computer. In this 
case, of course, the DD has to take into account only the 
work-load balancing and a 'dynamic' particles distribu- 
tion can be adopted during the tree-traversal. Anyway, it 
turns out to be worth subdividing the tree into upper and 
lower boxes for a balanced tree-setting. Such implementa- 
tion^ was carried out using OpenMP directives on a SUN 
Enterprise 4500 HPC machine, with 14 PEs. The results 
are very good and, given the same parameters, we veri- 
fied a 40% speedup in comparison with the T3E PGHPF 
implementation. 

Appendix A: Particle mapping 

Particles' locations are mapped converting each coordi- 
nate^], Xi, i = 1,2,3, into an integer triple qi = [xi x 
2 Zmax /Lj, where indicates the truncation to an in- 
teger, L is the root box' size and Z max is the maximum 
subdivision level a priori allowed. Then 91,2,3 are com- 
bined into an integer number (the 'key') Q € [0, 8 imax — 1], 
which is defined as: 

'max 

Q = ^ [ mod (L9i/ fc 'J,2) + 2 x mod(L<z 2 /fciJ,2) + 



z=i 



-4xmod(Lg 3 /fc;J,2)] fei , 



(A.l) 



where fc; = 2 imax ~'. Despite its complicated definition, Q 
can be rapidly evaluated by means of direct bit manipu- 
lation routinesPI available both in FORTRAN and in C. 

Given a particle's key, it is possible to determine 
quickly whether it belongs to a box or not, in a recur- 
sive fashion. Let us indicate the key binary representation 
as Q = {&3i max -i&3i max -2 • • -&2&i&o}2, then: known that the 
particle belongs to a certain box at the level I of the spatial 
subdivision, at the level I + 1 such particle will belong to 
the i-th sub-box (0 < i < 7) which is identified by the oc- 
tal digit i = {bk+2bk+ibk}2- This latter is made up of the 
three adjacent bits of Q from position k — 3(Z max — l — l) 

9 We are thankful to the CASPUR center (sited at the 
Universita di Roma "La Sapienza" ) for the resources provided. 

10 With the origin at a root box' vertex and the axes parallel 
to its edges. 

11 For instance, a multiplication of an integer, n, by 2-*, with 
j > (j < 0), corresponds to shift its binary representation 
by j positions left (right) , whereas mod(n, 2) = the least sig- 
nificant (rightmost) bit. Hence mod(n/2 m , 2), with m > 0, is 
the bit in position m, being the rightmost one in position 0. In 
FORTRAN such bit is given by ibits(n,m,l) 
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Fig. A.l. Top: a parent box (centered at O), has been 
"opened" to show the numbering of its sub-boxes (in 3-D), 
which allows to identify, by suitably extracting octal digits 
from the key Q of a particle, the sub-box which it belongs 
to. Bottom: example of a particle mapping shown in the 
2-D case for simplicity; in this case (l max = 3) each sub- 
box enclosing the plotted particle, is recursively identifies 
by base 4 digits made up of 2 bits. 

to the left. For I = any particle is enclosed in the root 



box. See the example in Fig. A.l 

Each key needs 3Z max bits to be handled. We used 
two long-integers (i.e. 16 bytes), thus allowing l max < 42, 
which is sufficient for most applications. Note that the 
evaluation of the keys of all the particles, with a compu- 
tational cost of order O(N), can be done just once, before 
the tree setting starts. 

The mapping method described in WS is similar, to 
some extent, to that above-illustrated, but in that case the 
authors do not use pointers to reproduce the tree topology, 
they rather use the particles' keys themselves (evaluated 
similarly as in Eq. [A.l| ) plus a hashing function to build an 
addressing space for all the boxes. Roughly speaking, in 
order to reduce the huge number of a-priori possible keys 
(e.g. in 3-D there are 8 imax possible values for Q), they 
truncate the keys binary representation, replacing the in- 
formation losted this way by means of linked lists. In the 
authors opinion this presents mainly the following advan- 
tages: i) it permits a direct access to any boxes, without 
needing a tree-traversal; ii) in a distributed memory con- 



text, it gives a unique addressing scheme for the boxes, 
independently of which PE owns them. We think that the 
advantage in point i) is not important in our implementa- 
tion because, as we have seen, any procedure involving the 
tree structure is performed recursively. A direct access to 
a box does not really speed up the execution; moreover, 
in WS' method the accesses are not really direct (expe- 
cially at upper levels) due to the presence of linked lists. 
As far as point ii) is concerned, also in our parallel ver- 
sion there is an addressing scheme which allows data in 
any sub-domain to be globally referenced (see Sect. 3.1 
and Appendix (bJ) . 

Appendix B: Construction of a global addressing 
scheme 

Given the binary representation of baddr= {b m -ib m -2 • • ■ 
&i&o}2 that is the address location of an upper box within 
the i-th PE's local memory, we define the box full-address 
as the s-bit number (being s the bit size of integer vari- 
ables, excluding the bit used for the sign): 

f addr = {100 • • • 0& m _i& m _ 2 ■ • • bib c 7 c & ■ ■ ■ cic } 2 , (B.l) 

being i= {c 7 cq ■ ■ ■ ciCo}2- Note that the leftmost bit, in 
position s — 1, is set to 1 in order to indicate that the box 
pointed is an upper one. Note also that 8 bits are used for 
i, allowing p < 256. To make the full-address be a single 
integer number it is necessary that m + 8 < s. Hence with 
8-bytes integers (s = 63) the local addressing is limited by 
2 55 that is sufficient for any purpose. 

In FORTRAN such 'bit concatenation' can be easily 
carried out this way: 

faddr = Ior(Ior(Ishft(baddr,8) ,i) ,mask), (B.2) 

where mask= 2 S_1 . Inversely, given the full-address of a 
box, the following bit manipulation instructions give the 
address in the local memory of the i-th processor which 
it belongs to: 



i = land (faddr ,mask2), 
baddr = Ishf t (Iand(f addr ,maskl) , -8) 

being maskl=mask— 1 and mask2= 255. 



(B.3) 
(B.4) 



Appendix C: Parallel time integration 

As far as the time integration is concerned, when the time 
advancing algorithm adopts individual time steps, at a 
given time the forces are evaluated only on a sub-set of 
all the particles. This leads to a work-load as more un- 
balanced as the sub-set is smaller. We experimented vari- 
ous possible solutions for such problem, in particular in 
the case of the leap frog algorithm wit h the block-time 
scheme (Porter 1985 ; Henquist & Katz 1989 ). According 
to this scheme, the z-th particle occupies the 'time bin' 
bi 6 {0, 1, 2, &max}, in the sense that it gets a time step 
Atj = r/2 bi , where r is the maximum time step allowed 
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(usually a fraction of a significant time scale for the sys- 
tem) . The simulation time (t) is advanced by the minimum 
time step used and, at a given t, the accelerations are eval- 
uated only on synchronized particles, i.e. on those whose 
acceleration was calculated last time at f — Atj (at t = 
all the particles accelerations are evaluated). According to 
some rules, Ati can also change with time. 

First, we found convenient to set the maximum a priori 
possible bin not that high, say 6 max < 6, so to avoid bins 
with too few particles within. This choice is also recom- 
mendable because of the intrinsic non-symplectic nature 
of the individual time step leapfrog. Indeed, every time 
a particle changes its bin, a lost of time symmetry oc- 
curs, leading to instability and to a long term energy drift. 
Furthermore, good results are obtained if, at a given in- 
stant, one assignes to synchronized particles a weight (w) 
much greater than those assigned to the others. One could 
be even tempted to set w = for non-synchronized parti- 
cles because no work will be done, in the current step, 
to update their accelerations. Nevertheless, this would 
give rise to a very unbalanced work-load during the tree- 
setting, because wide sets of PTERM boxes (containing 
zero-weight non-synchronized particles) may be assigned 
to the same PE, forcing it to build large portions of the 
upper tree. We found that a good compromise is to mul- 
tiply the weight of the currently synchronized particles 
- initially estimated as described in Sect. |3.1.l| - by the 
factor N/s, being s the number of these latter. 

This mantains the unbalancing comparable with that 
of the single force evaluation on all the particles, also in 
those highly dynamic situations in which the accelerations 
of the particles moving through very dense region (like for 
pairs of stars during close encounters occurring within the 
core of a globular cluster) need to be frequently updated. 



Warren, M.S., & Salmon, J.K. 1993, A Parallel Hashed 
Oct-Tree N-Body Algorithm in Supercomputing '93 (Los 
Alamitos: IEEE Comp. Soc), 12 
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